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Abstract 

We introduce a simple algorithm for reconstructing phylogenies from 
multiple gene trees in the presence of incomplete lineage sorting, that is, 
when the topology of the gene trees may differ from that of the species tree. 
We show that our technique is statistically consistent under standard stochas- 
tic assumptions, that is, it returns the correct tree given sufficiently many 
unlinked loci. We also show that it can tolerate moderate estimation errors. 



1 Introduction 

Phylogenies — the evolutionary relationships of a group of species — are typically 
inferred from estimated genealogical histories of one or several genes (or gene 
trees) HFel041 ISS03L Yet it is well known that such gene trees may provide mis- 
leading information about the phylogeny (or species tree) containing them. In- 
deed, it was observed early on that a gene tree may be topologically inconsistent 
with its species tree, a phenomenon known as incomplete lineage sorting. See 
e.g. MMad97l INicOll |Fel04|| and references therein. Such discordance plays little 
role in the reconstruction of deep phylogenetic branchings but it is critical in the 
study of recently diverged populations MLP02I IHM031 |Kno04| . 

Two common approaches to deal with this issue are concatenation and majority 
voting. In the former, one concatenates the sequences originating from several 
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genes and hopes that a tree inferred from the combined data will produce a better 
estimate. This approach appears to give poor results MKD071 . Alternatively, one 
can infer multiple gene trees and output the most common reconstruction (that is, 
take a majority vote). This is also often doomed to failure. Indeed, a recent, striking 
result of Degnan and Rosenberg MDR061 shows that, under appropriate conditions, 
the most likely gene tree may be inconsistent with the species tree; and this situation 
may arise on any topology with at least 5 species. See also BPN881 ITak89l for 
related results. 

Other techniques are being explored that attempt to address incomplete lineage 
sorting, notably Bayesian MELP07I and likelihood MSR07I methods. However the 
problem is still far from being solved as discussed in BMK06H . Here we propose a 
simple technique — which we call Global LAteSt Split or GLASS — for estimating 
species trees from multiple genes (or loci). Our technique develops some of the 
ideas of Takahata ||Tak891 and Rosenberg MRos021 who studied the properties of 
gene trees in terms of the corresponding species tree. In our main result, we show 
that GLASS is statistically consistent, that is, it always returns the correct topology 
given sufficiently many (unlinked) genes — thereby avoiding the pitfalls highlighted 
in IIDR061 . We also obtain explicit convergence rates under a standard model based 
on Kingman's coalescent HKin82L Moreover, we allow the use of several alleles 
from each population and we show how our technique leads to an extension of 
Rosenberg's topological concordance MRos02l to multiple loci. 

We note the recent results of Steel and Rodrigo BSR071 who showed that Max- 
imum Likelihood (ML) is statistically consistent under slightly different assump- 
tions. An advantage of GLASS over likelihood (and Bayesian) methods is its com- 
putational efficiency, as no efficient algorithm for finding ML trees is known. Fur- 
thermore, GLASS gives explicit convergence rates — useful in assessing the quality 
of the reconstruction. 

For more background on phylogenetic inference and coalescent theory, see 
e.g. llFel04l ISS031 IHS W05I iNorOTl ITav04ll . 

Organization. The rest of the paper is organized as follows. We begin in Sec- 
tion [2] with a description of the basic setup. The GLASS method is introduced in 
Section [3] A proof of its consistency can be found in Sections @] and [5] We show 
in Section [6] that GLASS remains consistent under moderate estimation errors. Fi- 
nally in Section [7] we do away with the molecular clock assumption and we show 
how our technique can be used in conjunction with any distance matrix method. 
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2 Basic Setup 



We introduce our basic modelling assumptions. See e.g. MDR06L 

Species tree. Consider n isolated populations with a common evolutionary his- 
tory given by the species tree S = (V, E) with leaf set L. Note that \L\ = n. For 
each branch e of S, we denote: 

• iV e , the (haploid) population size on e (we assume that the population size 
remains constant along the branch); 

• t e , the number of generations encountered on e; 

• r e = j^-, the length of e in standard coalescent time units; 

• fi = min e T e , the shortest branch length in S. 

The model does not allow migration between contemporaneous populations. Often 
in the literature, the population sizes {N e } ee E, are taken to be equal to a constant 
N. Our results are valid in a more general setting. 

Gene trees. We consider k loci Z. For each population / and each locus i, we 

(i) 

sample a set of alleles A4} . Each locus i e I has a genealogical history repre- 
sented by a gene tree Q {i) = (V^,S^) with leaf set £® = U t M.f\ For two 

leaves a, b in we let Z? v be the time in number of generations to the most re- 
cent common ancestor of a and b in Q( l \ Following MTak891lRos02ll we are actually 
interested in interspecific coalescence times. Hence, we define, for all r,s € L, 



Inference problem. We seek to solve the following inference problem. We are 
given k gene trees as above, including accurate estimates of the coalescence times 
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Our goal is to infer the species tree S. 
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Stochastic Model. In Section |4j we will first state the correcteness of our infer- 
ence algorithm in terms of a combinatorial property of the gene trees. In Section[5J 
we will then show that under the following standard stochastic assumptions, this 
property holds for a moderate number of genes. 

Namely, we will assume that each gene tree QW is distributed according to a 
standard coalescent process: looking backwards in time, in each branch any two 
alleles coalesce at exponential rate 1 independently of all other pairs; whenever 
two populations merge in the species tree, we also merge the allele sets of the 
corresponding populations (that is, the coalescence proceeds on the union of both 
allele sets). We further assume that the k loci 2 are unlinked or in other words that 
the gene trees {G^}iex are mutually independent. 

Under these assumptions, an inference algorithm is said to be statistically con- 
sistent if the probability of returning an incorrect reconstruction goes to as /c 
tends to +oo. 

3 Species Tree Estimation 

We introduce a technique which we call the Global LAteSt Split (GLASS) method. 

Inference method. Consider first the case of a single gene (k = 1). Looking 
backwards in time, the first speciation occurs at some time T\ , say between popu- 
lations n and s\. It is well known that, for any sample a from A4f{ and b from 
, the coalescence time Dy between alleles a and b overestimates the diver- 
gence time of the populations. As noted in [Tak8 9j, a better estimate of T\ can be 
obtained by taking the smallest interspecific coalescence time between alleles in 
M.^ and in M.^ , that is, by considering instead DrHi- 

The inference then proceeds as follows. First, cluster the two populations, say 
ri and s\, with smallest interspecific coalescence time V^ Sl . Define the coales- 
cence time of two clusters A, B C L as the minimum interspecific coalescence 
time between populations in A and in B, that is, 



Then, repeat as above until there is only one cluster left. This is essentially the 
algorithm proposed by Rosenberg MRos02l . In particular, Rosenberg calls the im- 
plied topology on the populations so obtained the collapsed gene tree. 

How to extend this algorithm to k > 1? As we discussed earlier, one could infer 
a gene tree as above for each locus and take a majority vote — but this approach 
fails MDR061 ; in particular, it is generally not statistically consistent. 
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Another natural idea is to get a "better" estimate of coalescence times by av- 
eraging across loci. This leads to the Shallowest Divergence Clustering method of 
Maddison and Knowles MMK06i We argue that a better choice is, instead, to take 
the minimum across loci. In other words, we apply the clustering algorithm above 
to the quantity 

V A B = min jl>^ : iGlj, 

for all A, B C L with A n B = 0. The reason we consider the minimum is similar 
to the case of one locus and several samples per population above: it suffices to 

(i) (i) 

have one pair a € M r , b € M. s (for some i) with coalescence time T across 
all pairs of samples in populations r and s (one from each) and all loci in I to 
provide indisputable evidence that the corresponding species branch before time T 
(looking backwards in time). In a sense, we build the "minimal" tree on L that is 
"consistent" with the evidence provided by the gene trees. This type of approach 
is briefly discussed by Takahata |Ta k89] in the simple case of three populations 
(where the issues raised by BDR061 do not arise). 

The algorithm, which we name GLASS, is detailed in Figure Q] We call the 
tree so obtained the glass tree. We show in the next section that GLASS is in fact 
statistically consistent. 

Algorithm GLASS 

Input: Gene trees {G^ }iei and coalescence times for all z £ X and a, b £ ; 
Output: Estimated topology S'; 

• [Intercluster coalescences] For all A, B C L with A n B = 0, compute 

V AB = min \v { 2 : i £ I, r £ A, s € B, a £ M® , b £ } ; 

• [Clustering] Set Q := {{r} : r £ L}\ Until \Q\ = 1: 

- Denote the current partition Q = {Ai, . . . , A z }; 

- Let A', A" minimize 'D ab over all pairs A, B £ Q (break ties arbitrarily); 

- Merge A' and A" in Q; 

• [Output] Return the topology implied by the steps above. 

Figure 1: Algorithm GLASS. 

Multilocus concordance. A gene tree with one sample per population is said to 
be concordant (sometimes also "congruent" or "consistent") with a species tree if 
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their (leaf-labelled) topologies agree. When the number of samples per popula- 
tion is larger than one, one cannot directly compare the topology of the gene tree 
with that of the species tree since they contain a different number of leaves. In- 
stead, Rosenberg MRos02| defines a gene tree to be topologically concordant with 
a species tree if the collapsed gene tree (see above) coincides with the species tree. 

We extend Rosenberg's definition to multiple loci. We say that a collection 
of gene trees {G^'}iei is multilocus concordant with a species tree S if the glass 
tree agrees with the species tree. Therefore, to prove that GLASS is statistically 
consistent, it suffices to show that the probability of multilocus concordance goes 
to 1 as the number of loci goes to +00. 

4 Sufficient Conditions 

In this section, we state a simple combinatorial condition guaranteeing that GLASS 
returns the correct species tree. Our condition is an extension of Takahata's condi- 
tion in the case of a single gene ||Tak89| . See also MRos02i 

As before, let S be a species tree and {QW a collection of gene trees. For a 
subset of leaves ACL, denote by (A) the most recent common ancestor (MRCA) 
of A in S. For a (internal or leaf) node v in S, we use the following notation: 

• [v\ are the descendants of v in L; 

• t v is the time elapsed in number of generations between v and [v\ ; 

• t v is the time elapsed in number of generations between the immediate an- 
cestor of v and [v\ . 

In particular, note that if e is the branch immediately above v, then we have 

te — t v t v . 

Also, we call the subtree below v, clade v. 

Our combinatorial condition can be stated as follows: 

(*) Vu, v G V, t(L u juL„j) < T^\u\\y\ < *(|uJuM>- 

In words, for any two clades u, v, there is at least one locus i and one pair of 
alleles a, b with a from clade u and b from clade v such that the lineages of a 
and b coalesce before the end of the branch above the MRCA of u and v. (The 
first inequality is clear by construction.) By the next proposition, condition (*) 
is sufficient for multilocus concordance. Note, however, that it is not necessary. 
Nevertheless note that, by design, GLASS always returns a tree, even when the 
condition is not satisfied. 
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Proposition 1 (Sufficient Condition) Assume that (*) is satisfied. Then, GLASS 
returns the correct species tree. In other words, the gene trees {G^}iez are niul- 
tilocus concordant with the species tree S. 

Proof: Let Q be one of the partitions obtained by GLASS along its execution and 
let B be the newly created set in Q. We claim that, under (*), it must be the case 
that 

B=[(B)\. (1) 

That is, B is the set of leaves of a clade in the species tree S. The proposition 
follows immediately from this claim. 

We prove the claim by induction on the execution time of the algorithm. Prop- 
erty (Q]) is trivially true initially. Assume the claim holds up to time T and let Q, 
as above, be the partition at time T + 1. Note that B is obtained by merging two 
sets B' and B" forming a partition of B. By induction, B' and B" satisfy CD). 
Now, suppose by contradiction that B does not satisfy (Q]). Let (B) , and (B)^ 
be the clades immediately below (B) with corresponding leaf sets C = [(B) ,\ 
and C" = \_{B)\\. By our induction hypothesis, each of B' and B" must be con- 
tained in one of C or C". Say B' C C and B" C C" without loss of generality. 
Moreover, since B does not satisfy dH), one of the inclusions is strict, say B' C C. 
But by (*), any set X in Q containing an element of C — B' has 

T^B'X < t(B'uX) < t(B)y = t(B) = t(B'UB") < T^B'B" ■ (2) 

To justify the first two inequalities above, note that X is contained in the partition 
at time T and therefore satisfies (Q]). In particular, by construction 

B'UX C C'. 

Hence by (121), GLASS would not have merged B' and B" , a contradiction. ■ 

5 Statistical Consistency 

In this section, we prove the consistency of GLASS. 

Consistency. We prove the following consistency result. Note that the theorem 
holds for any species tree — including the "anomaly zone" of Degnan and Rosen- 
berg IDR061 . 

Proposition 2 (Consistency) GLASS is statistically consistent. 
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Proof: Throughout the proof, time runs backwards as is conventional in coales- 
ced theory. We use Proposition Q] and give a lower bound on the probability that 
condition (*) is satisfied. 

Consider first the case of one locus and one sample per population. By (*), 
the reconstruction is correct if every time two populations meet, the correspond- 
ing alleles coalesce before the end of the branch immediately above. By classical 
coalescent calculations (e.g. BTav84l ). this happens with probability at least 

(l-e^)"" 1 , 

where we used the fact that there are n — 1 divergences. 

Now consider the general case. Imagine running the coalescent processes of all 
loci simultaneously. Consider any branching between two populations. In every 
gene tree separately, if several alleles emerge on either sides of the branching, 
choose arbitrarily one allele from each side. The probability that the chosen allele 
pairs fail to coalesce before the end of the branch above in all loci is at most e~ kfl 
by independence. Indeed, irrespective of everything else going on, two alleles meet 
at exponential rate 1 (conditionally on the past). This finally gives a probability of 
success of at least 

(1 - e - ***)" -1 . 

For n and fi fixed, we get 

(1 - e"^)™- 1 _» 1, 

as k — > +oo, as desired. ■ 

Rates. Implicit in the proof of Proposition |2] is the following convergence rate. 
Proposition 3 (Rate) It holds that 

P[ Multilocus Discordance ] < (n - l)[e~^] k . 
In particular, for any e > 0, taking 

A* V e 

we get 

P[ Multilocus Discordance ] < e. 

Proof: Note that 

1 - (l-eT^f- 1 < („- l)[ e -»\ k . 
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Multiple alleles v. multiple loci. It is interesting to compare the relative effects 
of adding more alleles or more loci on the accuracy of the reconstruction. The 
result in Proposition [3] does not address this question. In fact, it is hard to obtain 
useful analytic expressions for small numbers of genes and alleles. However, the 
asymptotic behavior is quite clear. Indeed, as was pointed out in MRos02|| (see 
also MMK061 for empirical evidence), the benefit of adding more alleles eventually 
wears out. This is because the probability of observing any given number of alleles 
at the top of a branch is uniformly bounded in the number alleles existing at the 
bottom. More precisely, we have the following result which is to be contrasted 
with Proposition [3] 

Proposition 4 (Multiple Alleles: Saturation Effect) Let S be any species tree on 
n populations. Then, there is a < q* < 1 (depending only on S) such that for 
any number of loci k > and any number of alleles sampled per population, we 
have 

P[ Multilocus Discordance ] > (q*) k > 0. 

In particular, for a fixed number of loci k > 0, as the number of alleles per pop- 
ulaiton goes to +oo, the probability that GLASS correctly reconstructs S remains 
bounded away from 1. 

Proof: Take any three populations a, b, c from S. Assume that a and b meet T\ 
generations back and that c joins them T2 generations later. For w = a,b,c and 

(i) 

i G 2, let Yw be the event that in locus i there is only one allele remaining at the 
top of the branch immediately above w. Let be the event that the topology of 
gene tree i restricted to {a, b, c] is topologically discordant with S. It follows from 
bound (6.5) in MTav841 that there is < q' < 1 independent of h such that 

PK^] > </, 

for alH € X and w G {a, b, c}. Also, it is clear that there is < q" < 1 depending 
on T2 such that 

P[Z»|Y«, Vwe{a,b,c}]>q", 
for all i G T. Therefore, by independence of the loci, we have 

P[ Multilocus Discordance ] 

>IIP[Z0|YW, y w£ {a,b,c}] H P[Y»] 

iET wG{a,b,c} 

> ((q'fq'T- 
Take q* = (q') 3 q". That concludes the proof. ■ 



9 



6 Tolerance to Estimation Error 

The results of the previous section are somewhat unrealistic in that they assume 
that GLASS is given exact estimates of coalescence times. In this section, we relax 
this assumption. 

Assume that the input to the algorithm is now a set of estimated coalescence 
times 



be the corresponding estimated intercluster coalescence times computed by GLASS. 
Assume further that there is a 5 > such that 



for all i 6 X and a, b € In particular, note that 



for all A,B CI. 

Let m be the shortest branch length in number of generations, that is, 




and, for all A,BQL, let 




T>AB - T^AB < ^, 



m = min{i e : e € E}. 



We extend our combinatorial condition (*) to 



(*) Vtt, v G V, t 



■<L«M«J> 



Then, we get the following. 



Proposition 5 (Sufficient Condition: Noisy Case) Assume that 




(3) 



and that (*) is satisfied. Then, GLASS returns the correct species tree. 
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Proof: The proof follows immediately from the argument in Proposition Q] by not- 
ing that equation © becomes 

T^B'X < T^B'X + $ 

< t(B'UX) - $ 
— t{B'UB") ~ & 

< V B 'B" - ^ 

< T^B'B" ■ 

Condition Q ensures that (★) is satisfiable. ■ 
Moreover, we have immediately: 

Proposition 6 (Consistency & Rate: Noisy Case) Assume that 

m 

Then GLASS is statistically consistent. Moreover, let 

m - 25 

A= , 

m 

then it holds that 

P[ Incorrect Reconstruction ] < (n — l)[e _/iA ] fc . 
In particular, for any e > 0, taking 

LiA \ e 

we get 

P[ Incorrect Reconstruction 1 < e. 



7 Generalization 

The basic observation underlying our approach is that distances between popula- 
tions may be estimated correctly using the minimum divergence time among all 
individuals and all genes. 

Actually, this observation may be used in conjunction with any distance-based 
reconstruction algorithm. (See e.g. MFel041 ISS031 for background on distance ma- 
trix methods.) This can be done under very general assumptions as we discuss 
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next. First, we do away with the molecular clock assumption. Indeed, it turns out 

(i) 

that Tr^ need not be the divergence time between a and b for gene i. Instead, we 

(i) 

take V a ^ to be the molecular distance between a and b in gene i, that is, the time 
elapsed from the divergence point to a and b integrated against the rate of mutation. 
We require that the rate of mutation be the same for all genes and all individuals 
in the same branch of the species tree, but we allow rates to differ across branches. 
Below, all quantities of the type V,V etc. are given in terms of this molecular 
distance. 

For any two clusters A, B C L, we define 

V AB = min |©2 : i G 1, r G A, s G B, a G M®, b G M^} , (4) 
as before. Let 

m! = mm{t e p e : e G E}, 

where p e is the rate of mutation on branch e. It is easy to generalize condition 
(*) so that we can use © to estimate all molecular distances between pairs of 
populations up to an additive error of, say, m'/4. Then using standard four-point 
methods, we can reconstruct the species tree correctly. 

Note furthermore that by the results of MESSW99II , it suffices in fact to estimate 
distances between pairs of populations that are "sufficiently close." We can derive 
consistency conditions which guarantee the reconstruction of the correct species 
tree in that case as well. 
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